fix(engine): centralize telemetry timer management in runtime manager by cijothomas · Pull Request #2804 · open-telemetry/otel-arrow

cijothomas · 2026-05-01T05:29:37Z

Motivation

Our idle perf tests were showing surprisingly high CPU usage. Investigation revealed that each node was independently calling start_periodic_telemetry(Duration::from_secs(1)), triggering internal metric collection (syscalls like getrusage, jemalloc stats, tokio worker metrics, channel snapshots) every second across all nodes. At 1s intervals, the telemetry overhead itself dominated idle CPU measurements, producing misleading results.

Changes

Centralizes telemetry timer registration in the runtime control manager so all nodes use the configured engine.telemetry.reporting_interval instead of each node managing its own timer independently. This removes ~250 lines of per-node boilerplate across 15 nodes, with shutdown cancellation handled centrally.

Also bumps the idle perf test to reporting_interval: 5s (with matching 5s Prometheus scrape) for a more realistic deployment baseline. With centralized timing and the 5s interval, idle CPU numbers look significantly better — engine CPU drops ~2.3x and pipeline CPU drops ~6.4x compared to the previous 1s configuration.

Adds validation rejecting zero-duration reporting_interval to prevent accidental spin loops.

Notes

perf_exporter.config.frequency is now silently ignored; deprecation/cleanup is a follow-up.
start_periodic_telemetry API is still public — TBD whether to remove.
TODO comment added in pipeline_ctrl.rs for an eager-timer-registration race flagged in review (telemetry can queue ahead of Shutdown in a slow-starting node's bounded control channel); to be addressed via a node-ready signal in a follow-up.

Fixes open-telemetry#1305 Previously, every node (receiver, processor, exporter) independently called effect_handler.start_periodic_telemetry(Duration::from_secs(1)) with a hardcoded 1-second interval. This was: - Not configurable by operators - Not enforceable (each node picked its own interval) - A significant contributor to idle CPU (~50 millicores on 4 cores) The runtime manager now registers telemetry timers for all nodes centrally during pipeline startup, using the configured engine.telemetry.reporting_interval. This: - Removes start_periodic_telemetry calls from all 15 node files - Eliminates per-node cancel handle management on shutdown - Enforces a single, consistent collection cadence by construction - Uses the existing configurable reporting_interval (default 1s) The idle test configuration is updated to use reporting_interval: 5s and a matching 5s Prometheus scrape interval, reducing idle CPU from ~0.9% to ~0.1% on 4 cores. Also fixes the idle-state-template Prometheus endpoint URLs to use the correct /api/v1 prefix.

codecov · 2026-05-01T05:32:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.01%. Comparing base (353bd54) to head (3242efd).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2804      +/-   ##
==========================================
- Coverage   86.03%   86.01%   -0.02%     
==========================================
  Files         720      720              
  Lines      273338   273249      -89     
==========================================
- Hits       235168   235048     -120     
- Misses      37646    37677      +31     
  Partials      524      524

Components	Coverage Δ
otap-dataflow	`87.17% <100.00%> (-0.02%)`	⬇️
query_abstraction	`80.61% <ø> (ø)`
query_engine	`89.57% <ø> (ø)`
otel-arrow-go	`52.45% <ø> (ø)`
quiver	`92.25% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

This PR centralizes periodic node telemetry scheduling inside the engine runtime-control manager so pipelines use the configured engine-wide reporting interval instead of having each node start and cancel its own telemetry timer. It also updates the idle perf test to use a slower 5s telemetry/scrape cadence that better matches the new centralized behavior.

Changes:

Pre-register telemetry timers for all nodes in the runtime-control manager and sync control-plane timer metrics immediately.
Remove per-node start_periodic_telemetry() / cancel-handle management from receivers, processors, exporters, and validation code.
Add a dedicated idle perf-test engine config with engine.telemetry.reporting_interval: 5s and align Prometheus scraping to 5s.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`tools/pipeline_perf_test/test_suites/integration/templates/configs/engine/continuous/otlp-attr-otlp-idle.yaml`	New idle perf-test engine config with 5s telemetry interval.
`tools/pipeline_perf_test/test_suites/integration/continuous/idle-state-template.yaml.j2`	Switches idle test to the new config and sets 5s scrape interval.
`rust/otap-dataflow/crates/validation/src/validation_exporter.rs`	Removes exporter-local telemetry timer startup.
`rust/otap-dataflow/crates/engine/src/processor.rs`	Removes processor wrapper telemetry timer lifecycle.
`rust/otap-dataflow/crates/engine/src/pipeline_ctrl.rs`	Centralizes telemetry timer registration in the runtime manager and updates tests.
`rust/otap-dataflow/crates/engine/src/control.rs`	Adds helper to enumerate registered node IDs.
`rust/otap-dataflow/crates/core-nodes/src/receivers/topic_receiver/mod.rs`	Removes topic receiver telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/receivers/syslog_cef_receiver/mod.rs`	Removes syslog receiver telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/receivers/otlp_receiver/mod.rs`	Removes OTLP receiver telemetry timer/cancel plumbing.
`rust/otap-dataflow/crates/core-nodes/src/receivers/otap_receiver/mod.rs`	Removes OTAP receiver telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/receivers/internal_telemetry_receiver/mod.rs`	Removes internal telemetry receiver timer startup.
`rust/otap-dataflow/crates/core-nodes/src/receivers/fake_data_generator/mod.rs`	Removes fake generator telemetry timer startup.
`rust/otap-dataflow/crates/core-nodes/src/exporters/topic_exporter/mod.rs`	Removes topic exporter telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/exporters/perf_exporter/mod.rs`	Removes perf exporter custom telemetry timer startup.
`rust/otap-dataflow/crates/core-nodes/src/exporters/parquet_exporter/mod.rs`	Removes parquet exporter telemetry timer management and related test.
`rust/otap-dataflow/crates/core-nodes/src/exporters/otlp_http_exporter/mod.rs`	Removes OTLP/HTTP exporter telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/exporters/otlp_grpc_exporter/mod.rs`	Removes OTLP/gRPC exporter telemetry timer/cancel handling.
`rust/otap-dataflow/crates/core-nodes/src/exporters/otap_exporter/mod.rs`	Removes OTAP exporter telemetry timer/cancel handling.
`rust/otap-dataflow/crates/contrib-nodes/src/exporters/geneva_exporter/mod.rs`	Removes Geneva exporter telemetry timer/cancel handling.
`rust/otap-dataflow/crates/contrib-nodes/src/exporters/azure_monitor_exporter/exporter.rs`	Removes Azure Monitor exporter telemetry timer/cancel handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

A zero-duration reporting_interval would cause telemetry timers to reschedule at the same instant repeatedly, spinning the runtime control loop. Validate that the interval is non-zero during config validation so the error is caught early with a clear message. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…try-timers # Conflicts: # rust/otap-dataflow/crates/core-nodes/src/exporters/otap_exporter/mod.rs # rust/otap-dataflow/crates/engine/src/processor.rs # tools/pipeline_perf_test/test_suites/integration/continuous/idle-state-template.yaml.j2

This node was added while the PR was in flight and still called start_periodic_telemetry(1s) directly, bypassing the centralized timer registration.

JakeDern

Looks good to me

lquerel · 2026-05-12T23:34:34Z

    use tokio::time::timeout;

-    const TEST_CONTROL_PLANE_METRICS_FLUSH_INTERVAL: Duration = Duration::from_millis(10);
+    const TEST_CONTROL_PLANE_METRICS_FLUSH_INTERVAL: Duration = Duration::from_secs(3600);


Why this change?

This is the per-node CollectTelemetry timer interval (new constant, not TEST_CONTROL_PLANE_METRICS_FLUSH_INTERVAL). Since timers are now pre-registered in the constructor, a low value floods control channels with CollectTelemetry messages during tests, causing 8 test failures. Setting it to 3600s keeps them dormant; tests that need telemetry drive it manually.

lquerel

Thank you for cleaning up this part of our internal telemetry reporting!

github-project-automation Bot added this to OTel-Arrow May 1, 2026

github-actions Bot added the rust Pull requests that update Rust code label May 1, 2026

cijothomas force-pushed the fix/centralize-telemetry-timers branch 2 times, most recently from 767831c to dbdd6d7 Compare May 1, 2026 18:08

cijothomas requested review from Copilot May 1, 2026 23:33

Copilot started reviewing on behalf of cijothomas May 1, 2026 23:34 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

Comment thread rust/otap-dataflow/crates/engine/src/pipeline_ctrl.rs

cijothomas force-pushed the fix/centralize-telemetry-timers branch 2 times, most recently from aa267ca to 752c024 Compare May 12, 2026 01:43

cijothomas force-pushed the fix/centralize-telemetry-timers branch from 752c024 to c087b78 Compare May 12, 2026 06:10

cijothomas added 4 commits May 12, 2026 06:22

chore: remove unrelated files accidentally included in PR

ecc447e

fix: remove unused Duration import in geneva_exporter

8f5d06a

Remove start_periodic_telemetry from host_metrics_receiver

d1397d8

This node was added while the PR was in flight and still called start_periodic_telemetry(1s) directly, bypassing the centralized timer registration.

cijothomas marked this pull request as ready for review May 12, 2026 21:23

cijothomas requested a review from a team as a code owner May 12, 2026 21:23

Merge branch 'main' into fix/centralize-telemetry-timers

be6abde

JakeDern approved these changes May 12, 2026

View reviewed changes

lquerel reviewed May 12, 2026

View reviewed changes

lquerel approved these changes May 12, 2026

View reviewed changes

Merge branch 'main' into fix/centralize-telemetry-timers

3242efd

lquerel enabled auto-merge May 12, 2026 23:38

lquerel added this pull request to the merge queue May 13, 2026

Merged via the queue into open-telemetry:main with commit 0c37386 May 13, 2026
85 checks passed

github-project-automation Bot moved this to Done in OTel-Arrow May 13, 2026

cijothomas deleted the fix/centralize-telemetry-timers branch May 13, 2026 04:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(engine): centralize telemetry timer management in runtime manager#2804

fix(engine): centralize telemetry timer management in runtime manager#2804
lquerel merged 8 commits into
open-telemetry:mainfrom
cijothomas:fix/centralize-telemetry-timers

cijothomas commented May 1, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

JakeDern left a comment

Uh oh!

lquerel May 12, 2026

Uh oh!

cijothomas May 13, 2026

Uh oh!

lquerel left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cijothomas commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Uh oh!

codecov Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

JakeDern left a comment

Choose a reason for hiding this comment

Uh oh!

lquerel May 12, 2026

Choose a reason for hiding this comment

Uh oh!

cijothomas May 13, 2026

Choose a reason for hiding this comment

Uh oh!

lquerel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cijothomas commented May 1, 2026 •

edited

Loading

codecov Bot commented May 1, 2026 •

edited

Loading